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Abstract —This paper considers the classification of linear 
subspaces with mismatched classifiers. In particular, we assume 
a model where one observes signals in the presence of isotropic 
Gaussian noise and the distribution of the signals conditioned 
on a given class is Gaussian with a zero mean and a low-rank 
covariance matrix. We also assume that the classifier knows only 
a mismatched version of the parameters of input distribution 
in lieu of the true parameters. By constructing an asymptotic 
low-noise expansion of an upper bound to the error probability 
of such a mismatched classifier, we provide sufficient conditions 
for reliable classification in the low-noise regime that are able to 
sharply predict the absence of a classification error floor. Such 
conditions are a function of the geometry of the true signal 
distribution, the geometry of the mismatched signal distributions 
as well as the interplay between such geometries, namely, the 
principal angles and the overlap between the true and the 
mismatched signal subspaces. Numerical results demonstrate that 
our conditions for reliable classification can sharply predict the 
behavior of a mismatched classifier both with synthetic data and 
in a motion segmentation and a hand-written digit classification 
applications. 

Index Terms —Classification, mismatch, linear subspace, 
Maximum-a-Posteriori classifier, error floor. 

I. Introduction 

Signal classification is a fundamental task in various fields, 
including statistics, machine learning and computer vision. 
One often approaches this problem by leveraging the Bayesian 
inference paradigm, where one infers the signal class from 
signal samples or measurements based on a model of the joint 
distribution of the signal and signal classes ||Tj Chapter 2], 

Such joint distribution is typically inferred by relying on 
pre-labeled data sets. However, in practical applications, the 
methods used to estimate the distributions from training data 
inevitably lead to signal models that are not perfectly matched 
to the underlying one. This can be due to an insufficient 
number of labeled data, the noise in the pre-labeled data |2), 
or due to the non-stationary statistical behaviour |5J. 

It is therefore relevant to ask the question: 

What is the impact that a mismatched classifier, i.e. a 
classifier that infers the signal classes based on an inaccurate 
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model of the data distribution in lieu of the true underlying 
data distribution, has on classification performance? 

We answer this question for the scenario where the 
data classes are constrained to lie approximately on a 
low-dimensional linear subspace embedded in the high¬ 
dimensional ambient space. Indeed, there are various problems 
in signal processing, image processing and computer vision 
that conform to such a model, some of which are: 

• Face Recognition: It can be shown that, provided that 
the Lambertian reflectance assumption is verified, the 
set of images taken from the same subject under dif¬ 
ferent lighting conditions can be well approximated by a 
low-dimensional linear subspace embedded in the high¬ 
dimensional space j6|. This is leveraged in several face 
recognition applications 0-0- 

• Motion Segmentation: It can also be shown - under the 
assumption of the affine projection camera model - that 
the coordinates of feature points associated with rigidly 
moving objects through different video frames lie in 
a 4 dimensional linear space HO), IJJ |. Jl2). This is 
leveraged in {T(J| to design subspace clustering algorithms 
that can perform motion segmentation. 

• In general, (affine) subspaces or unions of (affine) sub¬ 
spaces can also be used to model other data such as 
images of handwritten digits m- 

Our contributions include: 

• We derive an upper bound to the error probability asso¬ 
ciated with the mismatched classifier for the case where 
the distribution of the signal in a given class is Gaussian 
with zero-mean and low-rank covariance matrix. 

• We then derive sufficient conditions for reliable classifica¬ 
tion in the asymptotic low-noise regime. Such conditions 
are expressed in terms of the geometry of the true signal 
model, the geometry of the mismatched signal model 
and the interaction of these geometries (via the principal 
angles associated with the subspaces of the true and 
mismatched signal models as well as the dimension of 
the intersection of such subspaces). 

• We finally provide a number of results, both with syn¬ 
thetic and real data, that show that our sufficient con¬ 
ditions for reliable classification are sharp. In particular, 
we also use our theoretical framework to determine the 
number of training samples needed to achieve reliable 
classification in a motion segmentation and a hand¬ 
written digit classification applications. 
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A. Related Work 


The concept of model mismatch has been widely ex¬ 
plored by the information theory and communication the¬ 
ory communities. For example, in lossless source coding 
problems, mismatch between the distribution used to en¬ 
code the source and the true distribution is shown to lead 
to a compression rate penalty which is determined by the 
Kullback-Leibler (KL) distance between the mismatched and 
the true distributions tn Theorem 5.4.3], 

In channel coding problems, mismatch has an impact on the 
reliable information transmission rate that has been character¬ 
ized via inner and outer bounds to the achievable rate and error 
exponents of different channel models 115 1-|| 1 9). The problem 
of mismatched quantization is considered in |20|. 

The concept of mismatch has also been explored in the 
machine learning literature [5J. In particular, |5j studies the 
impact on classification performance of training sets consisting 
of biased samples of the true distribution, expressing classifi¬ 
cation error bounds as a function of the sample bias severity 
and type. The effect of label noise in the training sets is 
also considered in classification algorithms such as Support 
Vector Machines |3| and Logistic Regression classifiers 0. 
See also (2) for an overview of the literature on classification 
in presence of label noise. 

Signal classification and estimation using mismatched mod¬ 
els is also considered in |2T)-[[24). For example, )23) expresses 
bounds to the error probability in the presence of mismatch via 
the /-Divergence between the true and mismatched source dis¬ 
tributions, and |24| expresses the mean-squared error penalty 
in presence of mismatch in terms of the derivative of the KL 
distance between the true and the mismatched distributions 
with respect to the decoder signal to noise ratio (SNR). In 
particular, the work in |23) is closely related to our work in 
the sense that it also establishes bounds to the error probability 
in the presence of mismatch. The bounds presented in (23j are 
more general since they do not assume a particular form of 
probability density functions. Our work, on the other hand, 
leverages the assumption that signals are contained in linear 
subspaces in order to derive an upper bound that sharply 
predicts the presence or absence of an error floor. The bounds 
in |23j fail to capture the presence or absence of an error floor 
when specialized to the proposed signal model. 


B. Organization 

The remainder of this paper is organized as follows: Sec¬ 
tion |II] introduces the observation and signal models, the 
Mismatched Maximum-a-Posteriori (MMAP) classifier and 
the geometrical quantities associated with the signal and the 
mismatched model that are essential for the description of 
the MMAP classifier performance. The upper bound to the 
error probability associated with the MMAP classifier and 
the asymptotic expansion, which provide sufficient conditions 
for reliable classification in the low-noise regime, are given 
in Section |III] In Section [TV] the theoretical results are vali¬ 
dated via numerical experiments. Applications of the proposed 
bound in a motion segmentation task and in a hand-written 
digit classification task are given in Section [V] The paper is 


concluded in Section [Vi] The proofs of the results are given 
in the Appendix. 


C. Notation 

We use the following notation in the sequel: matrices, 
column vectors and scalars are denoted by boldface upper-case 
letters (X), boldface lower-case letters (x) and italic letters 
(x), respectively. In G R NxN denotes the identity matrix and 
OiUxiv € R MxN denotes the zero matrix. The subscripts are 
omitted when the dimensions are clear from the context. e fc 
denotes the k- th basis vector in R N . The transpose, rank and 
determinant operators are denoted as (-) T , rank(-) and | • |, 
respectively. ||x|| denotes Euclidean norm of the vector x and 
|jX|| 2 denotes the spectral matrix norm of the matrix X. The 
image of a matrix is denoted by im(-) and the kernel of a 
matrix is denoted by ker(-). The sum of subspaces A and 
B is denoted as A + B and the orthogonal complement of 
A is denoted as A 1 . log(-) denotes the natural logarithm, 
and the multi-variate Gaussian distribution with the mean p, 
and covariance matrix X is denoted as X). We also 

use the following asymptotic notation: f(x) = 0(g(x)) if 
lim^oo = c, where c > 0, and f(x) = o(g(x)) if 

liltlx-xcc = 0. 

II. Problem Statement 

We consider a standard observation model: 


y = x + n (1) 

where y £ R w represents the observation vector, x £ 
represents the signal vector and n ~ Af(0, cr 2 I) £ R^ repre¬ 
sents observation noise, where cr 2 denotes the noise variance 
per dimension^ We also assume that the signal x £ R w is 
drawn from a class c £ {1 with prior probability 

P(c = i ) = Pi, and that the distribution of the signal x 
conditioned on a given class c = i is Gaussian with mean zero 
and (possibly) low-rank covariance matrix X, £ R NxN , i.e. 

x|c = i ~ Af(Q, X*), (2) 


with rank(Xj) = rt < N. Therefore, conditioned on a given 
class c = i, the signal lies on the linear subspace spanned by 
the eigenvectors associated with the positive eigenvalues of 
the covariance matrix X,. 

The classification problem involves inferring the correct 
class label c associated with the signal x from the signal 
observation y. It is well known that the optimal classification 
rule, which minimizes the error probability, is given by the 
Maximum-A-Posteriori (MAP) classifier fl. Chapter 2.3]: 

c = argmax p(c = z|y) = argmax p(y\c = i)pi , (3) 


where p(c = z|y) represents the a posteriori probability of 
class label c = i given the observation y and 


p(y\ c =i) = 


x /(2 7 r)^|X i + ( T 2 I| 


0 -iW(s !+ff 2 ir 


(4) 


1 This noise vector can also model the fact that data does not always lie 
exactly on a low-dimensional subspace but rather approximately on a low¬ 
dimensional subspace G3 





TABLE II 

Relationships between the quantities used in the analysis 
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n ~ jV(0,1<r 2 ) 



Pi 

Si 






Fig. 1. System Model 
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represents the probability density function of the observation 
y given the class label c = i. 

However, we assume that the classifier does not have access 
to the true signal parameters p t , i = 1 ,... ,C and Sj, i = 
1,... ,C but rather to a set of mismatched parameters pi, i = 
1 ,... ,C and Si, i = 1,..., C, where pi is the mismatched a 
priori probability of the z-th class and S, is the mismatched 
covariance matrix associated with the class i with rank(S,) = 
f t < iVj^JfSee Fig. [I]) 

Such a Mismatched-MAP (MMAP) classifier delivers the 
class estimate 


c = argmax p(c = z|y) = argmax p(y\c = i)pi , (5) 


where p(c = z|y) denotes the mismatched a posteriori 
probability of class label c = i given observation y and 


P{y\ c = i) 




(6) 


■ fT“I 


denotes the mismatched probability density function of the 
observation y given the class label c = i. 

The probability of error associated with a MMAP classifier 
is given by: 

C 

P{e) = ^2pi- P{e\c = i) (7) 

2=1 


where 

P(e\c = i) 



i) 


■u max log 

\ iU 


f PjP{y\c = j) \ 
\PiP(y\c = i) ) 


dy 


( 8 ) 


and uf) is the unit-step function. This error probability cannot 
be calculated in closed form, but it can be easily bounded. 

Our goal is to study the performance of the MMAP clas¬ 
sifier by establishing conditions, which are a function of the 
geometry of the true and mismatched signal models as well as 
the interaction of such geometries, for reliable classification in 
the low-noise regime i.e. such that lim cr 2 _ ) ,Q P(e) = 0. 


A. Geometrical Description of the Signals 

Our characterization of the performance of the MMAP 
classifier will be expressed via various quantities that embody 
the geometry of the true signal model, the geometry of the 

- We assume that C and cr 2 are known. Since we study the scenario where 
<t 2 —> 0, the assumption that tr 2 is known exactly is immaterial. 


11 im(Uj) + im(Uj) = im(UP) + im(UP) + imfU'J 

2) im(UQ) = im(UCl) = im(Ui) (T im(Uy) 

3) im(UL) = im(Ui) n imffja) 2 - 

4) im(UL) = im(Uj) n im(UQ)- 1 - 

5) im(U'. .) x = im(Ui) 2 " + im(UQ) 

6) im(Ui) = im(Wij) + imfVij) 

7) im(Wij) = im(Uj) (T imfU^)^ 

8) im(Vij) = im(Ui) D im(Wij) x 

mismatched signal model, and their interplay. The quantities 
central to the analysis are given in Table [I] and the relationships 
between the presented quantities are summarized in Table [n] 

1) Quantities associated with the geometry of the true 
signal model or the mismatched signal model: The signal 
space corresponding to class i and the mismatched signal 
space corresponding to class i, which are subspaces of R N , 
are denoted as irri(S,) and im(S,), respectively. An orthonor¬ 
mal basis for im(S,) is denoted as U, £ M JVxri and an 
orthonormal basis for im(S,) is denoted as Uz £ K JVxri ; 
these quantities follow directly from the truncated eigen¬ 
value decompositions = UiA;Uf and = U,A,U/ 
where A, = diag(A^, A|,..., A*.) £ K riXri and Aj = 
diag(A^, A|,..., A~.) £ K riXri are diagonal matrices contain¬ 
ing the positive eigenvalues of S,; and S,;, respectively. Note 
that im(Sj) = im(Uj) and im(S,) = im(Uj). 

2) Quantities associated with the interplay between the 
geometry of the mismatched signal models: We consider 
quantities that reveal the relationship between the mismatched 
signal spaces of classes i and j. In particular, such quantities 
follow from the decomposition of the subspace im(Sj + 
Xj) = im(Uj) + im(U 7 ), which spans the mismatched signal 
subspaces of classes i and j, given by: 

im(U J )=im(Sj) 

im(Uj) + im^-) = im(UF) + im(Ug) + ^(U'J 

N_ - ^ 

im(Ui)=im(Si) 


where 

• UP. £ M. Nxri i represents an orthonormal basis for the 
intersection im(X,) n im(Sj) and fP is the dimension 
of im(Sj) (T im(Sj). This intersection is associated with 
class i as well as class j; 

• TJF £ R Nxr >i represents an orthonormal basis for the 
orthogonal complement of im(E,) nim(Sj) in im(Sj) 
and f'ij is the codimension of im(Sj) (T im(Sj) in 
im(Sj). im(IJP) can be interpreted as the subspace of 
the mismatched signal space corresponding to class i that 
is only associated with class i and not with class j\ 

• TJF £ M Arx P i represents an orthonormal basis for the 
orthogonal complement of im(Sj) n irn(Sj) in im(Xj) 
and fP is the codimension of im(X,) fl im(Sj) in 
im(Sj). im(UP) can be interpreted as the subspace of 
the mismatched signal space corresponding to class j that 
is only associated with class j and not with class i. 















TABLE I 

Main quantities used in the analysis 
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Subspace Dimension 


im(Ui) 

Ti 

im(Ui) 

h 

im(U 2 ) 

ij 

im(U'.) 

f 'a 

im(U 2 ) 

f 'a 

im(Wjj) 

sw 

im(Vij) 

s v 


Description 

signal space of class i 

mismatched signal space of class i 

intersection of the mismatched signal spaces of classes i and j 

subspace of mismatched signal space of class i that is not associated with the mismatched signal space of class j 

subspace of mismatched signal space of class j that is not associated with the mismatched signal space of class i 

subspace of signal space of class i that is orthogonal to im(UL) 

subspace of signal space of class i that is not orthogonal to im(TJL), i.e. it complements im(Wjj) in im(Ui) 


Note that U(7 together with \J' Z] and U), complete the 
basis for im(Uj) and im(U ? ), respectively, i.e. im(TJ,) = 
im([Ug Uy) and im(Uj) = im([U0. U'J). 

3) Quantities associated with the interplay between the 
geometry of the true signal model and the mismatched signal 
model: We also consider quantities that capture the interaction 
between the signal space corresponding to class i and the 
mismatched signal spaces of classes i and j. Such quantities 
are given by the decomposition of im(X,) = im(Uj) given 
by: 

im(Uj) = im(Wjj) + im(Vij) 

where 

. im(Wjj) = 1111(0,) n im(UG)- 1 - and W„ e R Nxs % 

represents an orthonormal basis for im(U,) nim(UG) _L 
where = dim(im(Uj)nim(Tjy )■*"). im(Wy) can be 
interpreted as the subspace of signal space corresponding 
to class i that is orthogonal to im(TJG); 

. im(Vjj) = im(Uj) D (im(Uj) l~l in^UE)- 1 )- 1 - and 
Vjj £ R Nxs *i represents an orthonormal basis for 
the orthogonal complement of im(U,) n imCLJG) 1 - in 
im(Uj) where sjj = 6 — dim(im(U,)nim(U^)- L ); then, 
sYj = diin(im(V. ( j)) = raiikfV,^) is the codimension of 
im(Wjj) in im(Uj). im(Vy) can be interpreted as the 
subspace of signal space of class i that is not orthogonal 
to im(UG), i.e. it complements im(Wy) in im(U,). 
Note that im([Vjj W^]) = im(Uj). 

4) Principal angles and distance between subspaces: 
Finally, our results will also be expressed via the principal 
angles between certain subspaces. In particular, consider a 
subspace y with an orthonormal basis Y £ E iVxs , where 
y = dim((V), and a subspace Z with an orthonormal basis 
Z £ R Nxz , where z = dim(Z), and define k = min (y,z). 
Then the principal angles 0 < 6i < ■ ■ ■ < 9^ < \ between y 
and Z are given by the singular value decomposition (SVD): 

Y t Z = HDJ t (9) 


where H £ U. yxy and J £ R zxz are orthonormal matrices 
and D € R yX2 is a rectangular diagonal matrix containing 
the singular values: 1 > d\ > ■ ■ ■ > dk > 0. Each singular 
value di corresponds to the cosine of the principal angle 9i 
between y and Z, i.e., di = cos (9i) [25, Chapter 8.7], 

The principal angles are used to define various distances on 
a Grassmann manifold [[26) . We will be predominantly using 
the max correlation distance between two subspaces 


which is a function of the smallest principal angle 9\, and the 
min correlation distance between two subspaces 

dmin(T,^) = dmin(Y,Z)= S in0 fc (11) 

which is a function of the largest principal angle Ok between 
the two subspaces. Note that we slightly abuse the notation in 
the second term of © and ©, as Y and Z are bases for 
the subspaces, not subspaces. 

5) Interpretation: It is instructive to cast some insight on 
the role of these various quantities in the characterization of 
the performance of the MMAP classifier. 

Consider a two-class classification problem that involves 
distinguishing class 1 from class 2 in the low-noise regime 
(so y « x). It is clear that the MMAP classifier will associate 
an observation y £ im(Ui 2 ) with class 1 and an observation 
y £ imCUy) with class 2; in turn, the MMAP classifier may 
associate an observation y £ im(Ui 2 ) either with class 1 or 
2. In general, the observation associated with class 1 is such 
that y £ im(Ui) = im(Vi 2 ) + im(Wi 2 ). 

The following example demonstrates the classification of 
y|c = 1 by the MMAP classifier where the covariance matrices 
are assumed to be diagonal. 

Example 1: We take the covariance matrices to be 

St = diag(l, 1,1, 0), S 2 = diag(0,1,1,1) 

St = diag(l, 1, 0, 0), S 2 = diag(0,1,1,0). 

The relevant quantities (see Table [I]) are given as: 

Ui = [ei, e 2 ,63], U 2 = [e 2 ,e 3 ,e 4 ] 

Ui = [ei,e 2 ], U 2 = [e2,e 3 ], 

and 

Uy = ei, Ut 2 = e 2 , U' 21 = e 3 . 

We also determine im(Wt 2 ) and im(Vi 2 ): 

im(W 12 ) = im(Ui) n imlU^) 1 - 

= im([ei,e 2 ,e 3 ]) (Tim([e 2 ,e 3 ,e 4 ]) = im([e 2 ,e 3 ]) 
im(Vt 2 ) = im(Ui) Cl im(Wi 2 ) _L = e 3 . 

Assume now that y £ im(Vt 2 ) and note that im(Vt 2 ) = 
im(U' 12 ). Therefore, y £ im(Vt 2 ) will be classified as class 
1 by the MMAP classifier. In contrast, assume now that 
y £ im(W 12 ) and note that im(W 12 ) contains im(U 21 ). 
Therefore y £ im(W 12 ) may be classified as class 2. 

Next, we modify the mismatched model of class 2 as 

S 2 =diag(0,1,0,1), 


d max (y, Z) = d max (Y, Z) = sin 6»i 


( 10 ) 
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which leads to U 21 = e 4 . Note now that im(Wi 2 ) does not 
contain im(U 21 ) and y £ im(Wi 2 ) will not be associated 
uniquely with class 2 by the MMAP classifier. 

It is now clear that the relationship between subspaces 
im(W 12 ) and im(U 21 ) will play a role in the characterization 
of conditions for perfect classification in the low-noise regime. 

The next example demonstrates the role of principal angles 
in the conditions for perfect classification in the low-noise 
regime. 



(a) Example of wrong classification with the MMAP classifier 


Example 2: We take the signal space bases as: 


Ux = [0,1] T , U 2 = 


cos 


, sm 


U, = 


5tt\ . /5tt\ 


cos — , sm — 


6 J 


6 J 


©r 

u 2 = u 2 . 


The relevant quantities (see Table [T]> are given as: 


Ui 2 = Ur, U 12 = {0} , U 21 =U 2 

and W 12 = {0}, V 12 = Ui. The geometry of the signals 
and decision regions is presented in Fig. [2] (a). Note now that 
y|c = 1 € im(Ui) can potentially be associated to the correct 
class 1 depending on the distance (computed according to 
an appropriate metric) between im(Vi 2 ) and im(U' 12 ) and 
the distance between im(V 12 ) and im(U 21 ). In particular, 
the angle between im(Vi 2 ) and im(U' 12 ) is greater than 
the angle between im(Vi 2 ) and im(U 21 ), which leads to 
misclassification of signals from class 1. On the other hand, 
if we take 


Ui = 



, sin 



T 


the angle between im(Vi 2 ) and im(U , 12 ) is smaller than the 
angle between im(Vi 2 ) and im(U 21 ), which leads to perfect 
classification of signals from class 1 in the low-noise regime. 
This case is presented in Fig. [2] (b). 


The ensuing analysis shows how these various quantities - 
which are readily computed from the underlying geometry of 
the true subspaces and the mismatched ones - can be used as 
a proxy to define sufficient conditions for perfect classification 
in the low-noise regime. In particular, these quantities bypass 
the need to compute the decision regions associated with the 
MMAP classifier in order to quantify the performance. 


III. Conditions for Reliable Classification 

We now consider (sufficient) conditions for reliable classi¬ 
fication in the low-noise regime. We derive these conditions 
directly from a low-noise expansion of an upper bound to the 
error probability associated with the MMAP classifier. 

The following upper bound to the probability of error 
associated with a MMAP classifier will play a key role in 
the analysis. 

Theorem 1: Set a.^ > 0V(i, j) ,i ^ j. Set 

Sy = (Sj + a 2 1) 1 +aij(ibj + cr“I) 1 

-ay (Si + a 2 !© 1 • (12) 



(b) Example of correct classification with the MMAP classifier 
Fig. 2. The two plots illustrate the decision regions associated with the 2- 
class MMAP classifier for different values of Ui, U 2 , Ui and U 2 in the 
limit cr 2 —> 0. Transparent blue and red regions indicate the decision region 
where MMAP outputs class labels 1 and 2, respectively. Blue line represent the 
signal subspace im(Ui) and red line represent the signal subspace im(U 2 ). 
Dashed blue line represents the mismatched signal subspace im(TJi). The 
subspace bases are given in Example 2. 


Then the error probability associated with the MMAP classifier 
in Q can be bounded as follows: 

• If X.y >- 0 V(i,/) with i 7 ^ j, then 


c / c 

P(e) < P{e) = ^2pi- I A e *j) 

i=l 


03) 


where 


P(e lj ) 


(h 

\Pi\ +cr 2 I| J 


HIS,: 


(14) 


• If 3(i. j) with i ^ j : S y - 0 then P(e) < P(e) = 1. 

Proof: The proof appears in Appendix. ■ 

This upper bound to the error probability of the MMAP 
classifier can capture the fact that the error probability may 
tend to zero as the noise power approaches zero, depending 
on the relation between the true signal parameters and the 
mismatched ones. In particular, the upper bound to the mis¬ 
classification probability of class i is expressed as a function 
of the covariance matrix of class i, the mismatched covariance 
matrix of class i and the mismatched covariance matrices of 
classes j f i. In contrast, the bound proposed in J23) expresses 
the upper bound to the error probability as a function of the 
sum of /-divergences between the true and the mismatched 
distributions of class i, for all classes i. Therefore, it does not 
capture the interplay between mismatched models of different 
classes. In addition, when specialized to the proposed signal 
model, the bound in ]23| always predicts the presence of an 
error floor (see Section |fV|). 
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The following Theorem presents a low-noise expansion 
of the upper bound to the error probability of the MMAP 
classifier. 

Theorem 2: The upper bound to the error probability of the 
MMAP classifier in ( fl3] l can be expanded as follows: 

• Assume that V(i, j). i j- j, the following conditions hold: 

im(Wjj) C imCU',^ , (15) 

- %(%) T )V ZJ ^0 or 4=0, 

(16) 

and take d = min^^y) dij, where 

dij = — (s^ + oiij[rj — r'i) s ) , (17) 

and a,j £ (0, a? ) where the value of a?- > 0 is given 
in the Appendix. Then 

- If d < 0: 

P(e)=0( 1), a -2 —^ 0. (18) 

- If d > 0: 

P(e)=A-(a 2 ) d + °((tJ 2 ) d ) , a 2 ->0,(19) 
where A > 0. 

• Assume 3 i j, such that conditions ( p~5] > or © 
do not hold. Then 

P(e) = C>(l), cr 2 H>0. (20) 

Proof: The proof appears in Appendix. ■ 

The expansion of the upper bound to the error probability 
embodied in Theorem 2 provides a set of conditions, which 
are a function of the geometry of the true signal model, the 
geometry of the mismatched signal model, and the interaction 
of the geometries, that enable us to understand whether or 
not the upper bound to the error probability may exhibit an 
error floor. In particular, in view of the fact that we use the 
union bound in order to bound the error probability of a multi¬ 
class problem in terms of the error probabilities of two-class 
problems, these conditions have to hold for every pair of class 
labels (i,j), i ^ j. We can note that: 

• The upper bound to the probability of error exhibits an 

error floor if either ( | 1 5[ > or ( fl6| ) are not satisfied for some 
pair i f j. The interpretation of condition ( fl~5j ) 

is straightforward by noting that the subspace iirifW,,) 
contains vectors of class i that are orthogonal to the 
subspace im(Uy), which is the subspace uniquely as¬ 
sociated with class i. Then, condition © states that 
such vectors must also be orthogonal to the mismatched 
subspace uniquely associated with class j, i.e. im(TJV). 
The interpretation of condition ( p~6l > is obtained by refor¬ 
mulating the expression as: 

v£-(u' y {%) T - U',(U', ) T )V y >- 0 (21) 

x T U' J (U' J ) T x > x T U' i (U' i ) T x Vx € im(Vy) 
ll(U4 Tx ll2 > ll(U'i) T x|| 2 Vx e im(Vij). 


Note that ||(UV) T x|| 2 = ||Uy(UV) T x|| 2 is the norm 
of the projection of x onto im(TJV). Therefore, © 
requires that the norm of vectors in im(V,; 7 ), which are 
associated with class i, projected onto im(Uy), which 
is also associated with class i, is greater than the norm 
of vectors in im(V. (? j projected onto im(U' , ( ), which is 
associated with class j. 

Equation ( |2Tj ) is also implied by 

dmin(Vy, U'y) < d max (Vy, U'J (22) 

which requires that the largest principal angle between 
im(Vy) and im(UV) is smaller than the smallest princi¬ 
pal angle between im(Vy) and imCLJV) 3 Demonstration 
of this condition is provided by Example [2] in Section 

EDS 

• On the other hand, the upper bound to the probability of 

error does not exhibit an error floor if conditions ( fl5| and 
© are satisfied for all pairs i f j and d > 0. In 

particular, necessary and sufficient conditions for d > 0 
depend on the dimension of the various subspaces and 
their relation, i.e. sYj > 0 for all pairs (i. j) such that 
fj — fi < 0 is necessary and sufficient for d > 0. For 
example, if the rank of all covariance matrices associated 
to the mismatched model is the same, i.e., if fi = f, for 
i = 1 ,..., C, then sV > o, \/(i, j), i ^ j is necessary 
and sufficient for d > 0. Note that a positive value for s)' :] 
indicates that there is at least one vector in im(U t ) that is 
not contained in im(U^ •)-*-, or equivalently, there exists at 
least one vector in im(Uj) that has a non-zero projection 
onto im(TJV), therefore leading to reliable classification 
of signals from class i. 

• Note that parameters do not play a role in the 
characterization of the necessary and sufficient conditions 
for d > 0. In fact, the conditions for d i7 > 0 do 
not depend on a particular value of a t j , provided that 

e (o> )■ 

• Note also that the value of d represents a measure of 
robustness against noise in the low-noise regime, as it 
determines the speed at which the upper bound of the 
error probability decays with 1/a 2 . In particular, higher 
values of d will represent higher robustness against noise, 
in the low-noise regime. For example, on assuming f, = f 
for i = 1 ,,C, we observe that larger values of sV 
correspond to larger values of d. Therefore, as expected, 
higher levels of robustness are obtained when the over¬ 
lap between im(U,;) and im(U' J -)- L , i.e. dimension of 
im(Wjj), is reduced. 

We also discuss how the value of d, 7 in equation ( |17[) 
relates to the value of dij for the non-mismatched easel 4 ) 
In particular, we assume that r, = r, = f t = fj and 
that the true and the mismatched covariance matrices are 
diagonal. Then for the non-mismatched case 

dij = - dim(im(Uj) (T im(Uj))) 

3 The detailed derivation of this statement is reported in Appendix. 

4 Note that our comparison involves upper bounds on the error probabilities 
rather than the actual error probabilities. 
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and for the mismatched case 

dij =-(r»— dim(im(Uj) Dim(Uj)) 

— dim(im(Uj) ft ker(Uj) IT ker(Uj))). 

Therefore, in the non-mismatched case d,j is at most 
r,; and it decreases as the dimension of the intersection 
of the signal spaces of classes i and j increases. In the 
mismatched case r./ 7J is also at most r,, but it decreases 
as the dimension of the intersection of the signal space 
of class i and the mismatched signal space of class j 
increases, and as the dimension of the intersection of 
the signal space of class i and the noise subspace of the 
mismatched classifier, i.e ker(Uj)fTker(tJj), increases. It 
can also be easily verified that the value of d for a non- 
mismatched 2 class problem obtained in (27) matches 
the value of d derived via the proposed bound. Note 
that the bound analyzed in © is different than the 
bound proposed in this paper and it is only valid for 
non-mismatched models. 

• The constant A in ( fl9) distinguishes the upper bounds 
for different mismatched models with a constant d, in the 
low-noise regime, and is determined as the ratio of vol¬ 
umes of subspaces associated with true and mismatched 
signal subspaces and their interaction. See Appendix for 
the detailed expression. 

Theorem 2 therefore leads immediately to sufficient condi¬ 
tions for reliable classification in the low-noise regime. 

Corollary 1: If 

im(Wij) C imfU'-J 1 - ^ j , (23) 

- U;.(U' ( .) :, ')V,y >- OV(i, j),i^ j (24) 

and sYj > 0V(i,j) such that f 3 — f, < 0 , then 

lim^a^o P(e) = 0. 

Proof: This follows directly from Theorem 1, since 

lim (T 2_ > ,g P(e) = 0 => lim^s^o P(e) = 0. ■ 

Corollary 2: If 

dmin(Uj,Uj) < d max (Uj,Uj) and sYj > 0V(t,j),i p j (25) 
then lim CT 2 ->o P( e ) = 0. 

Proof: The proof appears in Appendix. ■ 

Note that the conditions in Corollary 2 are implied by (hence 
are weaker) the conditions in Corollary 1. 

The conditions for reliable classification are particularly 
simple for the scenario where true and mismatched covariance 
matrices are diagonal. 

Corollary 3: Assume S.,, S, , i = 1...., C are diagonal. If 

im(Wy) C im(UC)- L V(z, j),i f j (26) 

and sYj > 0V(i,j) such that fj — fi < 0 , then 

lim CT 2^ 0 P(e) = 0. 

Proof: The proof appears in Appendix. ■ 

Note that in diagonal case the sufficient conditions for perfect 
classification simplify only to inclusion of subspaces. Recall 


the Example 1 where we demonstrate that the signals in 
im(W,j) may be associated with class i or with class j. 
Condition ( |26| ) formalizes the intuition that the signals in 
im(Wy) must be orthogonal to im(UC), which is uniquely 
associated with class j. 

We finally illustrate how our conditions cast insight onto the 
impact of mismatch for a two-class case where the mismatched 
subspaces are a rotated version of the true signal subspaces. 

Example 3: Consider a two-class classification problem 
where x|c = 1 ~ 7V(0, UiUf) and x|c = 2 ~ 7V(0, U 2 U|’) 

and 

Ut = QiUi, U 2 = Q 2 U 2 , (27) 

where Qi G R NxN and Q 2 G R NxN are orthogonal matrices, 
and si 2 ,s 2 i > O^By defining 

ei = ||I — QiII2 , e 2 = ||I — Q 2 II 2 (28) 

812 = max cos 6} 2 = ^1- d^ in (Ui,U 2 ), (29) 

it follows that 

1 — 812 > + e 2 => lim P(e) = 0 . (30) 

a 2 —>0 

The proof is in the Appendix. 

This example provides sufficient conditions for reliable 
classification in the low-noise regime by relating the degree 
of mismatch - measured in terms of the spectral norm of 
the matrix I — Q,, % = 1,2 - to the minimum principal 
angle between subspaces. It states that the larger the minimum 
principal angle between the spaces spanned by signals of class 
1 and class 2 , i.e. the larger 1 — <5i 2 , the more robust is 
the classifier against mismatch, where the level of mismatch 
is measured by ei + e 2 . The maximum robustness against 
mismatch is obtained when 8 1 2 = 0 , which means that signals 
from class 1 and class 2 are orthogonal. 

This example also provides a rationale for state-of-the-art 
feature extraction mechanisms where the signal classes are 
transformed via a linear operator $ prior to classification. 
In particular, assume that Si and S 2 correspond to the 
covariances of signals in class 1 and 2 after the transformation 
< 6 : the example suggests that the operator $ should transform 
the signal covariances so that 812 is small (i.e. so that the 
signals from class 1 and 2 are close to orthogonal) in order 
to create robustness against mismatch. Such an approach is 
considered, for example, in { 28) , where signals are transformed 
by a matrix, which promotes large principal angles between 
the subspaces. Note that the work in (28) is not motivated on 
the basis of robustness against mismatch, but rather on intuitive 
insight about classification of signals that lie on subspaces. 


IV. Numerical Results 

We now show that our conditions for reliable classification 
in the low-noise regime are sharp, by revisiting the Examples 
|T] and [2] presented in Section 11 - A | T he model parameters and 
results are summarized in Table lllll 


5 This condition insures that the mismatched subspaces are not completely 
orthogonal to the signal subspaces. 






TABLE III 

Mismatch examples given in Section IU-AI 



Model 

Theory 

lim CT 2 K,P(e) 

Simulation 

^ m £T 2 —>0 P( e ) 

(a) 

Ui 


[ei,e 2 

e 3 ] , U 2 = 

[e 2 ,e 3 ,e 4 ] , Ui 

= [ei,e 2 ] 

,u 2 

= [e 2 ,e 3 ] 


> 0 

> 0 

(b) 

Ui 


[ei,e 2 

e 3] , U 2 = 

[e 2 ,e 3 ,e 4 ] ,U 4 

= [ei,e 2 ] 

,u 2 

= [e 2 ,e 4 ] 


= 0 

= 0 

(c) 

Ui = [0, 

1] 

t ,u 2 = 

[cos(f) , 

sin (f )] ,U 4 = 

[cos(f), 

sin ( 

f)].u 2 = 

= u 2 

> 0 

> 0 

(d) 

Ui = [0, 

1] 

t ,u 2 = 

[cos(f) , 

sin (f)] ,Ui = 

[cos (^) 

sin ( 

f)].u 2 = 

= u 2 

= 0 

= 0 


Fig. [3] shows the estimated true error probability, which 
is obtained from simulatiorj^J the upper bound to the error 
probability given in Theorem 1 and the bound proposed in [231 


(using the KL-divergence) as a function of a 2 . Note that the 
proposed upper bound to the error probability and the derived 
sufficient conditions give a sharp predictions of an error floor, 
and also that the bound proposed in (23) always exhibits an 
error floor. 

In case (a), condition ( [15) in Theorem 2 is not satisfied for 
{i,j) = (1,2), i.e. im(Wi 2 ) = im([e 2 ,e 3 ]) £ in^U^) 1 - = 
im(e 3 )" L , therefore, via Theorem 2 we conclude that the upper 
bound exhibits an error floor. The results in Fig. [3] show that 
in this case the true error probability also exhibits an error 
floor. In case (b), conditions ( fT5j ) and © are satisfied and 
d > 0. Therefore, via Theorem 2, the upper bound to the error 
probability approaches zero, which also implies that the true 
error probability approaches zero, in the low-noise regime. 

For cases (c) and (d) the intuition is provided by the Corol¬ 
lary 2 , where in the case of the one-dimensional subspaces 
the concept of principal angles simply reduces to the notion of 
angle between two lines. In particular, in case (c) the condition 
© in Corollary 2 is not satisfied for (i,j) = (1,2), and we 
observe an error floor in the true error probability. On the 
contrary, in case (d) the conditions ( |25] l in Corollary 2 are 
satisfied which immediately implies perfect classification in 
the low-noise regime. 

We now explore how different mismatched models affect 
the value of d. Consider the following 2-class example in R 6 
with orthonormal basis vectors e,;, i = 1 ,..., 6 , where the 
signal spaces are: 


Ui — [ei, e 2 , e 3 ], 

U 2 = [e 4 , e 5 , e 6 ] 

(31) 

and various mismatched signal spaces are: 


Ur = [ei], 

U 2 = [e 4 ] 

(32) 

Ui = [ei, e 2 ], 

U 2 = [e 4 ,e 5 ] 

(33) 

Ui = Ur, 

U 2 = U 2 . 

(34) 


It is straightforward to verify that the sufficient conditions for 
perfect classification given by Theorem 2 hold for all three 
pairs of mismatch models ( [32] >, ( |33j ) and ( |34| ). Furthermore, 
one can also determine the values of d as 0.5, 1 and 1.5, 
where values of d do not depend on a, : j, for the mismatched 
models given by (|32|). (|33j) and (|34)>. respectively. As observed 


in Section III a higher value of d implies a higher robustness 


to noise. Simulation results of the true error probability and 
the values of the upper bounds as given in Theorem 1 are 

6 In our simulations, signals are drawn independently from the true distri¬ 
bution and are classified by the MMAP classifier. 



-10 10 30 50 70 90 
1/cr 2 [dB] 


Fig. 4. Black, blue and red lines correspond to the simulated error 
probabilities for examples given by {32}, (33} and 0 respectively. Dashed 
black, blue and red lines correspond to the upper bound given in Theorem 1 
for examples given by (32} , (33} and (34} , respectively. 


plotted in Fig. [4] One can observe that increasing values of 
d (associated with the upper bound to the error probability) 
correspond to steeper decrease of the true error probability as 
a 2 —> 0. Moreover, the values of d obtained via the upper 
bound match the values of d obtained from the simulation of 
the true error probability for all the examples @-((34). 

V. Applications 

We finally show how theory can also capture the impact 
of mismatch on classification performance in applications 
involving real world data. We consider a motion segmentation 
application, where the goal is to segment a video in multiple 
rigidly moving objects, and a hand-written digit classification 
application. In both tasks we concentrate on a supervised 
learning approach, in which we are given a number of labeled 
samples, which are used to estimate the model (training set) 
and a number of unlabeled samples that we want to classify 
(testing set). Our aim is to determine the minimum size of the 
training set needed to guarantee reliable classification of the 
testing set. 

A. Datasets 

For the motions segmentation task we use the Hopkins 155 
dataset [29], which consists of video sequences with 2 or 3 
motions in each video. The motion segmentation problem is 
usually solved by extracting feature points from the video and 
tracking their position over different frames. In more details, 
in this application, observation vectors y are obtained by 
stacking the coordinate values associated to a given feature 
point corresponding to different frames, and the objective of 
motion segmentation is that of classifying each feature point 
as belonging to one of the moving objects in the video (l0j|. 

Theoretical results show that the features points trajectories 
belonging to a given motion lie on approximately 3 dimen¬ 
sional affine space or 4 dimensional linear space GM3- 
We validate that empirically by observing the decay of singular 
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(a) lim CT 2 _ >0 P(e) > 0. 


(b) lim o . 2 _ >0 P(e) = 0. 


(c) lim CT 2 _^ 0 P(e) > 0. 


(d) lim CT 2 _^ 0 P ( e ) ~ °- 


Fig. 3. Simulation results for the examples in Table 1. In all plots, the black line corresponds to the true error probability P(e) obtained via simulation, the 
red line corresponds to the proposed upper bound to error probability P(e) given in Theorem 1 and the dashed orange line corresponds to the upper bound 
in |23| (with KL-divergence). 


values of the data matrix associated with a given motion, 
which is shown in Fig. [5] (a). Note that singular values are 
close to zero for singular value indices that are greater than 4. 

For the experiment we consider a video with 3 motion^] 
where number of samples of class 1, class 2 and class 3 is 236, 
142 and 114, respectively. The rule adopted to pick the video 
was the maximal possible feature points - samples - for each 
motion. The ranks of the true and the mismatched covariances 
is always set to 4. We also split the dataset samples randomly 
into a training set and a testing set, where the training set 
contains n max = 90 samples per class. 

For the hand-written digit classification task we use the 
MNIST dataset (30), which consists of 28 x 28 grey scale 
images of hand-written digits between 0 and 9. We obtain 
observation vectors y by vectorizing the images. 

The decay of singular values associated with the data 
matrix of MNIST digits is shown in Fig. [5] (b). Note that 
the singular values do not approach zero as fast as in the case 
of the Hopkins dataset. We can argue that the classes in the 
MNIST dataset are only “approximately low-rank”, i.e. the 
covariance matrix associated with the class i can be expressed 
as S; = XI, fZT, where X], is low-rank and S > 0 accounts for 
the deviation from the perfectly low-rank model. In view of 
the presented signal model this can be interpreted as a classifi¬ 
cation of signals with low-rank covariance matrix X], at finite 
a 2 = 5. The sufficient conditions for perfect classification in 
the case of “approximately low-rank” model will now predict 
what number of training samples is required to achieve the 
best possible error rate for the given classification problem. 

The ranks of the true and the mismatched covariances is 
always set to 20 in the experiments. Such rank leads to 
capturing approximately 90 % of the energy of the signals. 
The split into training and testing set is provided by the 
MNIST dataset, where the training set contains approximately 
n-max = 5000 samples per class. 


B. Methodology 

We obtain the class-conditioned covariance matrices by 
retaining only the first r principal components of the esti¬ 
mated covariances obtained via the maximum likelihood (ML) 
estimatoi[3 for each class. The covariance matrix associated 
with the “true model” of class i is obtained by estimating 

7 Denoted as “1RT2RCR” in the dataset. 

s Note that this is equivalent to computing the empirical covariance matrix. 



2 4 6 8 10 12 14 


Singular val. index 
(a) Hopkins dataset 



40 80 120 160 200 
Singular val. index 


(b) MNIST dataset 


Fig. 5. Normalized singular values of data matrices corresponding to: (a) 
motions in the Hopkins dataset and (b) digits in the MNIST dataset. For 
Hopkins dataset only the first 15 out of 58 singular values are shown. For 
MNIST dataset only the first 200 out of 784(= 28 X 28) singular values are 
shown for the first 3 classes. 


the covariance matrix on all available data samples of class i, 
and the covariance matrices associated with the “mismatched 
model” of class i are obtained by estimating the covariance 
matrix on rii data samples of class i. 

Results are produced as follows: in each run rii samples 
are drawn at random from the training set for various values 
of rii, * = 1...., C, and the signal covariances are estimated. 
The error rate of the MMAP classifier is then evaluated on the 
testing set. At the same time, we also determine if sufficient 
conditions for perfect classification as in Theorem 2 hold. We 
run 1000 experimental runs with the Hopkins dataset, where 
in each run dataset is split at random into training and testing 
sets. We run 20 experimental runs with the MNIST dataset, 
where in each run the draw of the rii samples from the training 
set is random for * = 1 ,..., C. 

The particular choice of samples in the training set can lead 
to high variability in the mismatched models, especially for 
small number of training samples. Therefore, in the following, 
we have chosen to report the results as follows: 

• we state that analysis predicts reliable classification if the 
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(a) p p — 0.8, rc .3 — 70 



(b) p p — 0.9, 713 — 70 

Fig. 6. Phase transition of true error rate and phase transition given by 
the upper bound to the error probability as a function of number of training 
samples n\, n 2 . Black corresponds to an error floor of the true error rate, 
white corresponds to reliable classification, and red line denotes the phase 
transition predicted via Theorem 2 for a given probability p p . 


sufficient conditions in Theorem 2 hold with probability 
p p over the different experiment runs; 

• we also state that simulation predicts reliable classifica¬ 
tion if the true error probability is 0 with probability p p 
over different experiment runs; 

• if the simulated error rate exhibits an error floor we report 
the worst case error rate with probability p p : the error 
rate that is achieved at least with probability p p over all 
experimental runs. 


C. Results 

The results for the Hopkins dataset are reported in Fig. [6] 

We observe that the phase transition predicted by analysis 
approximates reasonably well the phase transition obeyed by 
simulation. In particular, we can use our theory to gauge the 
number of training samples required for perfect classification 
in the low-noise regime. As expected, we also observe that 
the larger value of p p gives more conservative estimates of 
the required training samples. This holds for both simulation 
and analysis. 

We also observe that identical trends hold for other values 
of n 3 . In particular, for 713 < 30 simulation does not show a 
phase transition and likewise analysis does not show a phase 
transition either (these experiments are not reported in view 
of space limitations). In contrast, for 77,3 > 30 both simulation 
and analysis predict a phase transition in the error probability. 

The results for the MNIST dataset are reported in Fig. [7] 
Note that the number of training samples per class is the same 
for all classes, i.e. n, = n*, i = 1,..., C. 

In contrast to the results with the Hopkins dataset, the error 
rate obtained on the MNIST dataset exhibits an error floor. 


However, we observe that the worst case error rate reduces 
with a higher number of training samples and reaches an 
error floor at sufficiently large number of training samples. We 
also observe that the phase transition obtained via Theorem 
2 predicts reasonably well the number of training samples 
needed to reach the error floor. 

Finally, note that real data are not drawn from Gaussian 
distributions or perfect linear subspaces (the two main ingre¬ 
dients underlying our analysis). Nevertheless, we have shown 
that the derived bound has practical value even when the two 
assumptions do not hold strictly. 

VI. Conclusion 

This paper studies the classification of linear subspaces 
with mismatched classifiers, i.e. classifiers that operate on a 
mismatched version of the signal parameters in lieu of the true 
signal parameters. In particular, we have developed a low- 
noise expansion of an upper bound to the error probability 
of such a mismatched classifier that equips one with a set of 
sufficient conditions - which are a function of the geometry of 
the true signal distributions, the geometry of the mismatched 
signal distributions, and their interplay - in order to understand 
whether it is possible to classify reliably in the presence of 
mismatch in the low-noise regime. 

Such sufficient conditions are shown to be sharp in the 
sense that they can predict the presence (and the absence) 
of a classification error floor both in experiments involving 
synthetic data as well as experiments involving real data. These 
conditions have also been shown to gauge well the number of 
training samples required for reliable classification in a motion 
segmentation application using the Hopkins 155 dataset and a 
hand-written digit classification application using the MNIST 
dataset. 

Overall, we argue that our conditions can also be used 
as a proxy to develop linear feature extraction methods that 
are robust to mismatch. In particular, our study suggests that 
such methods ought to orthogonalize the different classes as 
much as possible in order to tolerate model mismatch. This 
intuition has been pursued in recent state-of-the-art linear 
feature extraction methods. 


A. Preliminaries 


Appendix 


We introduce additional quantities and Lemmas that are 
useful for the proofs. 

a) Quantities: We define the projection operators: 


Pi 


= UiUf , P, = UjUf 
p ij = %(%) T , 


(35) 

(36) 


where Uj, U, and U' are given as in Section 


H-A 


In 


addition to the bases Uj and U, for the im(Sj) and im(Si), 
respectively, we also introduce the bases for the ker(S,) 

and ker(S.,) as £ M NxN-n and jji. g R NxN-r^ 

respectively. We define the projection operators onto this 
subspaces: 


Kj = ) J , Kj = Uj“(U~. 


(37) 
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(a) Classification of 2 digits 



(b) Classification of 5 digits 



(c) Classification of 10 digits 


Fig. 7. The worst case error rate and phase transition predicted via Theorem 2 for a given probability p p are plotted for classification of MNIST digits. Solid 
red and blue lines correspond to worst case error rates for p p = 0.9 and p p = 1, respectively. Dashed vertical lines denote the phase transition predicted via 
Theorem 2 for p p = 0.9 (red) and p p = 1 (blue). 


We also define 


L i =U i (diag(Ai,..., (38) 

L i =U,(diag(Ai,..., At) + cr 2 I)~ 1 (U,;) T , (39) 


and write 


Kjj , 


where L y = L, 

K,, = Kj + Q!i 7 K; — Note that 


aijLj 


K zj = Kj 

= Kj 


O jj P j Ctjj P j 

~ a *jPji 


(40) 
and 

(41) 

(42) 


in view of the fact that Pj + Kj = I and Pj + Kj = I and 
Pj — Pj = P'j — P' j. The last equality simply follows from 
of P'j a 

and Up given in Section 


the definition of P', and P',, and the definitions of U' ; , U', 

~ L J -- L J J 1 


II-A 


P - Pj = u'j(u'j 


■UR(Uf 


“ (Uji(U'i) T + UP(UP) T ) 


= p' p' 

y j* 


Finally, we present a decomposition of x € 


(43) 

(44) 

. We write 



X = X|| + X_L = x v + X W + Xj_, 

(45) 

where 

N 

II 

X 

(46) 


X_L = U^Z_l 

(47) 


xv = VjjZ V 

(48) 


xw = WjjZw, 

(49) 

for some 

zw £ 

vectors zn £ K ri ,z_L £ M JV_ri ,zv € 
w 

. Note also that xy = zy , xw 

and 

= ll z w || 


and ||xj_|| = ||zj_||. 
b) Lemmas: 

Lemma 1: The following equality holds: 


im(Uy) _L = ker(P'j) = ker(Uj) + (im(Uj) n im(Uj)). 


Proof: By leveraging the definition of P'j in ([36} we 

have 

ker(P 7 j) = (im(P'j)) = (im(U'j)) = im([UP,U4]) 

= im(Uj) -L + (im(Uj) fl im(Uj)). 



■ 

Lemma 2: The following statement holds: 

^min (V ij i U, j ) < ^max ij 

(50) 


■ (51) 

Proof: First, note that 

- Ujj(Ujj) T )Vjj = (Vjj) T (P'j - 

P ^)Vjj 

Then we write the following 

(Vjjf P'jVjj = ;V,) r U'iU') y V„ 

(52) 

(Vjj) T PFVjj = (Vjj) T Ujj(Ujj) T Vjj. 

(53) 


Note that the singular values of (Vjj) T U'j and (V,;j) T U' !; 
correspond to the cosines of the principal angles between and 
im(Vjj) and im(U'j), and im(Vjj) and im(U'j), respec¬ 
tively. We then consider the SVDs 

(Vjj) T U'jj = HjjDjjJ?}- (54) 

(Vjj) T U'j = HjjDjjjJj (55) 

where the dimensions of matrices Hjj, Hjj, Djj, Djj, Jjj 
and Jjj follow from the dimension of the Vjj, TJ7 ■ and U' ?: 
as shown in (|9}. We can now express ( |5T} as 

u,,l), D/ n;', a HjjDjjDjjHjj. (56) 

It is straightforward to see that ([50} implies ( [51} . ■ 

Lemma 3: The following equalities and inequalities hold: 

x T L,x > —^-j-||x|||| 2 (57) 

x T (Lj-Lj)x>-^-||x|| 2 (58) 

x T KjX = ||x_l|| 2 . (59) 

Proof: The inequality in ( [57} is due to the fact that x £ 
im(Lj) = im(Uj) and A ,4 +1 is a lower bound to the minimum 
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positive eigenvalue of L,. The inequality in ( |5 8 [ > is due to the 
fact that Lj is positive semidehnite and that -J is an upper 

bound for the largest eigenvalue of Lj. The equality in ( |59| ) 
follows from the definition of the projector K,. ■ 

Lemma 4: Assume that 

im(Wjj) C imjU'j)- 1 and (60) 

- u',iu;,) 7 )V„ >- o. (6i) 

Denote by Co the smallest eigenvalue of 

= c Vijfcp'i - p ;-)v w . 

Then 


xT (Kj - Kj)x > c 0 11x v 11 2 - 211x v 11 || x -l|| - ||x_l|| 2 . 

(62) 


Proof: Note that ( |49| implies xw £ im(Wjj) = 
im(Sj) nker(P8), and the condition © also implies xw £ 
ker(P', : ) = imfU'^)- 1 . Then, we can write 

x T (K i ^K l )x = x T (P' J .-P' !; )x (63) 

= Xv(P ij - Pji)x V 

+2x^(P , ii ^ P'-j)x_L 
+ x I(Pij - Pj-i)xL (64) 


and 


we 


note that condition © implies the 


> 


Co|| x V 


P' P' 


are 


lower bound x^ (P£ — P£)x v 

Moreover, all the eigenvalues of 
contained in the interval [—1,1] [31 Theorem 26], 
so that x^(PL - P£)x_i_ > -||x_l|| 2 , and, on 

leveraging Cauchy-Schwarz inequality, we also have 
x v(Pij — Pj'ijxj. > —2||x V ||||xj_||. - 


B. Proof of Theorem 1 

We prove Theorem 1 by using the fact that u(x) < exp(cta;), 
Vx, a > 0 and by leveraging the union bound. 

Recall from ([7]) that the error probability associated with 
the MMAP classifier can be expressed as 

C 

P{e) = ^2pi-P(e\c = i) (65) 

i=l 

where P(e\c = i) = P(c ^ i\c = i) is the error probability 
for signals in class i. Via the union bound, we can state that 

C 

P(e\c = i) = P{c i\c = i) < P(c = j\c = i) (66) 


where 


/ OO 

p(y\c = i ) 

-OO 

'pjP( y\c = j) 


u log 


Pip{ y|c = i) 


(67) 
dy. (68) 


We will denote P(c = j\c = i) = P(e.if). Now, by letting 
aij > 0 Vi / j we can upper bound the step function to 
obtain 


/ OO 

p(y\c= 1) 

-OO 


• exp 



f Pjp(y\c = 
\Pip(y\c = i))) 


VpJ {\£j + * 2 i\J 

•((2^) w |S i + a 2 I|)-5 


dy 




y T S ii y ) dy = P (e^), 


(69) 


where we recall 


Sjj — (Sj + <t 2 I) 1 +a,j(Sj+fT“I) 1 — aij(Ei + ( j 2 I) 1 . 

If >- 0 V* f j, then the integral in (|69| converges 
Vi f j. Therefore, we can bound the error probability as 
follows: 


c / c 

P(e) < P(e) = ^> • E ^( e ii) 

1 = 1 \j= 


(70) 


where 


P( e ij) 


(h / |Si + a 2 i| E iJ 
\Pi V l^+^II J 


■ (|S, +ct 2 I||S u -|)"5 . 


(71) 


If 3i j : S,, f 0 then the integral in ( |69l > does not 
converge. Therefore, we trivially bound the error probability 

as P(e) < P(e) < 1. 


C. Proof of Theorem 2 


The proof is presented in two parts. First, we establish suffi¬ 
cient conditions for Ey >- 0; second, we establish conditions 
for the upper bound to the probability of misclassification to 
approach zero as the noise approaches zero. 

1) Positive Definiteness of Sy.- The following two Lemmas 
gives sufficient conditions for >- 0. 

Lemma 5: Assume that si > 0, 

L J 


im(Wy) C im(UP) , 

v£(uuu;,) T - uyu'/jVy >- o 




co 


Aj +1 l + co(l + J-)’ 


where cq is the smallest eigenvalue of 


v£(U'.(u If - U'i(U ',) 7 )V >:/ 
= (Vjj) T (P' - P'iV, y . 


Then 


£,,■ >- 0, Vcr 2 £ 0,min ( 1 


1 ttj j 


(72) 

(73) 

(74) 


(75) 
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Proof: To show this we first produce a lower bound: 
1 


x 1 SyX = x 1 LijX H-yx j K ijx 


(76) 


= x 1 L^x + aijX 1 (Lj — L,)x 


~ (x T K,x + a i ,x T (K i - K 4 )x) (77) 

(78) 


> c T Cc, 

where c = [||x w ||, ||x v ||, ||xjJ] T and 

0 


c = 


A^+l A~. 

o 

0 


1 otjj . cadi 

A i + 1 At ' °‘ 2 


V- 


(79) 

by using the equalities and inequalities (|57|)-(|59|) and ( |62| i. 

Now, by using standard algebraic manipulations, it is pos¬ 
sible to show that the choice ( |74} leads to C >- 0 , hence ( |75| ) 
holds. 


Lemma 6 : Assume that s;, : = 0, 

L J 

im(W,j) C im(UL)- 

( Al 


A’i + 1’ At +1’ 


Then 


T,ij A 0, Vcr 2 £ ( 0,min ( 1, 


1 CHjq 


AJ 


(80) 

(81) 

(82) 


x 1 S,jX > 


110 'Ixll 2 


Proof: We prove the Lemma by constructing the lower 
bound 

1 I, || 2 

ATT 11 *" 1 ' “At 

+4 (ii x ±n 2 +«y xT (Pii -p;,)A (s3 ) 


> 


1 

a ij \ 

Aj + i 

aJ" 

( 1 - 

a ij 




■X_L 


(84) 


x T (K i +a ij (PL-PL))x > 0,Vx £ im([V.y, U 2 -]). Namely, 
by leveraging the equality in ( |59| and inequality in ( |62[ i, we 
can write 


x T (Kj + n„.:P; ; - PL))x > (1 - «ij)||xj_| 

-2 a.i 

+ayCo||x v | 

where xi,xv have been defined in (47 i and (|48|. If 


||x v | 

m2 


c ° then the right hand side of (|85) is always strictly 


(85) 

< 


HO 


co+l - .-. 

positive, unless x = 0. Then, since the condition in (|74J) 

implies a,; ? < we can conclude that K^ A 0 and 

im(Wjj) = ker(Kjj) and im([V.y, U 2 -]) = im(Kjy). There¬ 
fore, rank(Kjy) = rankQVy, U 2- ]) = s} 2 + (TV — rf). 
Assume now that ( |72| ), s} 2 = 0 and ( |8T| > are satisfied. 


In this case x 2 K; 


x = x 


*7 

T 'K i 


i(p« - p;-t))x = 


(Ki +aij(PL - P'y))x_ L > \\x ± f(l -_aij), where we 
have used the fact that eigenvalues of PL — PL contained in 


the interval [—1,1], Since ( |8T| > implies a, : j < 1 we conclude, 
via an argument similar to that in previous paragraph, that 

K ij A 0 and rank(Ky) = rank(UA) = s] 2 + (TV — rf). 

■ 

Lemma 8 : Assume that condition f72[ ) given in Lemma [5] 
holds. Assume also that s] 2 > 0 and ( |73[ i and \1A\ given in 
Lemma [5| hold, or that s} 2 = 0 and ( |8T| given in Lemma |6] 
holds. Tnen, as a 2 —» 0, we can write 


1 


I + ^2 I — Vi 0 ' ( a 2 


1 



r K,.j — 1' 


, (86) 


where r r, = rank(K, 7 ), and i>ij is given as 
•■ij J l v •Jr 


= 


pdet(K y )|(U£..) r L?,Ui. 


-‘ij'-’K, 


IK, 


if ? ’k < TV 
if r K ’ = TV 


L°- = lim 0 


>oLij=L?- 


otij Lj - and 


L° = U i (diag(Ai,...,Ag)- 1 (U i f 
L? = U i (diag(A* 1 ,...,Ai))- 1 (U i ) r . 


(87) 

( 88 ) 

(89) 


by using the inequalities equalities and inequalities (|57]>-(|59|) 
and ( |62| ), and by noting that xv = 0. The choice ( |8T| then 
leads to ( f82] ). ■ 

2) Part 2: Low-noise Expansion: To obtain the low-noise 
expansion of the upper bound to the error probability we first 
present two supporting Lemmas. 

Lemma 7: Assume that condition ( f72[ ) given in Lemma [5] 
holds. Assume also that sj 2 > 0 and and © given in 
Lemma |5] hold, or that sj 2 = 0 and ([87} given in Lemma |6] 
holds. Then Ky A 0 and rank(K, 7 ) = TV + s] 2 — ry. 

Proof: Assume that m *Yj > o m and © are 
satisfied. By definition, im(W,j) = im(Ej) flker(PL) and, 
as a consequence of (72 1 , it also holds im(W u ) C ker(PL), 
which leads to im("W, ? ) C ker(K, 3 ). Moreover, it is straight¬ 
forward to note that imQV.y, U 2 -]) = (im(Wjj)) 1 . Then, 
in order to prove that K (y A 0, we show that x 1 K, y x = 


Proof: Note first that the sufficient conditions imply 
K i;i A 0 via Lemma [7] We can write the eigenvalue decom¬ 
position of K ij\ 


Kjj = U K , 


A k , 

0 


uL, 


(90) 


where Ur 


TxN 


is orthogonal and Ar. = 


K K 

diag(A 1 ij ,..., Kff ) contains the positive eigenvalues of 
K i:j , with r Kij = rank(K ii7 -). 

Now, we can write 


1 


a 

T T 


2 K«| = 


^ A K !: 

0 


0 

0 


E 


(91) 


where E = l L,,Ur, . We also denote by E^...^ 
the principal submatrix of order TV — m obtained by delet¬ 
ing the rows and the columns of the matrix E. 

; P^ i EPj 1 ...j m , where the matrix 
is obtained by picking all the columns 


Note that E^..., 


IxN—m 
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from the identity matrix with the column indices different 
from Then, the Poincare separation theorem 132 

Corollary 4.3.37] guarantees that the eigenvalues im are 
bounded by the minimum and the maximum eigenvalues of E, 
which correspond to the minimum and maximum eigenvalues 
of L ij. Moreover, as cr 2 —> 0, while the diagonal elements 
of -4 Ak,- grow unbounded, the eigenvalues of L i} , and 


therefore, also the determinant of E, 


are bounded. 


Then, we can use the determinant decomposition in 


r k, 


Theorem 2.3] to express 

= N: 


a 2 I 1^-2 I 


T ^ 2 Ky| as follows. If 

+ Si + ... + Sn-i j 

(92) 


where 

Sm=J2 


l<ii <...<i m <iV 




E, 


1 < m < N - 1 

(93) 


and the summation is over all possible ordered subsets of m 
indices from the set {!,..., r'K.,. ( }• Otherwise, if tk,, < N: 


-‘ij + — 1^9'I 


Si 


Sr K„, (94) 


where 

s m =J2 


l<ii <...<i m <r Kij 


(^) 


IE.,; 


CF 2A im ) 1-^*1 —. 

1 < m < r'K. : , • 

(95) 


Now we show that ([87]) holds. We first assume rank(Ky) = 
N and take the right hand side of ( |92| ) and multiply it by 
(^) rK H° 2 Y Kii to get 

+ (°’ 2 ) + Si + . . . + Sn- l)) ■ 

(96) 


Note now that for all S m , m = — 

lim CT 2 _ ) ,o((T 2 ) rK o S m = 0 and lim (7 2 _ ) . 0 (cr 2 ) , ' K « |Ly | = 0. 
Therefore, ( |87[ holds for the case rank(Ky) = N. To show 
the derivation of Vij for the case rank(Ky) < N we use the 
same technique where we multiply by (4 2 -) rKii (<r 2 ) rKy the 
right hand side of to get 

(tr 2 ) K,J (! I-,, + Si +... + S rK i j j . 

(97) 

As a 2 —j 0 we can write (<7 2 ) rKij S ric .. = 

pdet(Ky)|(U^-..) T L° J U^ : . |. This concludes the derivation 
of ( [87[ . Note also that > 0, since the pseudo-determinant 
and the determinants in (|87[ are greater than zero. 


We now provide the low-noise expansion of the upper bound 
to the probability of misclassification. 


Assume that sufficient conditions for positive definiteness 
of Ejj, Vi / j do not hold. Then, the upper bound to the 
probability of error is chosen to be P(e) = 1, so that in general 
it does not tend to zero as a 2 tends to zero. 

Assume now that the sufficient conditions for Sy >- 0 as 
given in the first part of the proof hold \/i / j. Then, the upper 
bound to the probability of misclassification can be written as 
follows^] 


P ( e ) 

* 


(t I & + ° 2I \ \ 

\pi\l\tj+a 2 I\) 


•(|S i+( r 2 I||Sy|)-J. 


(98) 


We will now produce a low-noise expansion of ( [98] ) in order to 
understand whether or not lim cr 2 ^ 0 P{e) - 0. The following 
low-noise expansions are trivial: 


|S; + (J 2 I| = 


= o 



+ ct2 ) ) 

(a 2 ) N ~ ri 

(99) 

k- 1 / 



to 

1 

•i 

a 2 ->• 0 


n(4i.+o- 2 ) j 

{a 2 ) N ~ fi 

(100) 

fc=i / 



(w™) . 

(J 2 —Y 0 . 



The low-noise expansion of S l;) is more involved and it is 
provided in Lemma [8] 

Then, it follows immediately that the low-noise expansion 
of each term in the upper bound to the probability of error in 
( |98j ) is given by 

Aij ( a 2 ) d ' ] + o , (101) 

where 

dij = -^((N-r i )-(N-r j )) 

~\i N -Ti)~ ^(— rank(Ky)) 

= ~ n) + sYj), (102) 



and 

Vi = pdet(E,), Vi = pdet(Ej). (104) 


It follows immediately that the low-noise expansion of the 
upper bound to the probability of error in ( |98| is given by 

P(e) = A(a 2 ) d + o((a 2 ) d ) , (105) 

where d = min^) d %3 and A = J2(i,j)es d Aij where S d = 
{( i,j ) = dij = d}. 


9 Note that a value for which aij satisfies the conditions for Y,ij y 0 
always exists and therefore does not affect the derivation of the low-noise 
expansion. 
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D. Proof of Corollary 2 

Assume sYj > and 

rfmin(Ui,Ui) < d m ^(\5i,\Jj)y{i,j),i (106) 

Note that cZ rnin (U,;, Ui) < d max (Ui, U-,-) implies 

Uf (UiUf - Uj-UjjUi >- 0 (107) 

Uf - %(%) T )V t >- 0, (108) 

where we have used result in Lemma [2] in the Appendix A. 

By taking x £ Wy or x £ V tJ it is straightforward to show 
that ( |108| l implies ( p~5| ) and ( [T6l >, thus obtaining conditions 
identical to those in Corollary 1. 

E. Proof of Corollary 3 

We prove the corollary by showing that in diagonal case 
( [T6l > always holds. Note first that 

im(Vjj-) = im(Uj) n (im(Uj) n im(U' ii )- L )' L 
= im(Ui) ("I im(U,y) C im(U^). 

It is also straightforward to establish that ( fl6| ) holds if and 
only if im(V,j) C imCtJC)- 1 -, and this always holds since 
im(Vij) C im(UL and im(UC C im(U' i )- L . 

F. Derivation of Example [77] 

We prove statement ( |30| ). by showing that 

1 - S 12 > N{e i + e 2 ) (109) 

together with si 2 , s 2 i > 0 implies the sufficient conditions for 
perfect classification in Corollary 2. 

Assume U, and U j are given and the singular values of 
(U,;) 7 Uj are known. We also know that U, = Q. ; U 7 -. We 
can write 

(U^U, = (U^U, + (U z ) t (Q, - I)Uj (110) 

On leveraging |34] Theorem 1], we can state that the i-th 
singular value d,, associated with (Ui) T Uj lies in the interval 

[rf i -||(U i ) T (Q,-I)U J || 2 ,d i +||(U i ) T (Q i -I)UJ 2 ], where 

di is the i-th singular value of (Ui) T U_,-. Then, we can write 
the upper bound 

||(Uj) T (Qj - I)U/|| 2 < ||Qj - I|| 2 = ej (111) 

where the first inequality follows from the SVD separation 
theorem |35] Theorem 2.2], Note also that 

(Ui) T Ui = I + (Ui) r (Qj - I)Uj (112) 

where the singular values of (Ui) T Ui are bounded by 1 ± 
||(Ui) T (Qi — I)Ui|| 2 . By leveraging ( | 111| ) we can further 
bound the singular values as lie,. 

Note now that 1 — 6 1 2 > (ei + e 2 ) if and only if 1 — ei > 
£12 + r 2 , which implies 

^min (Ui,Ui) < d max (Ui, U 2 ) , 


and is also equivalent to 

maxcoSfe((Ui) T U 2 ) < mincosz((Ui) T Ui), (114) 

k l 

where maxt coSfc((Ui) T U 2 ) denotes the cosine of the 
smallest principal angle between im(Ui) and im(U 2 ), 
maxfc coSfe((Ui) T U 2 ) denotes the cosine of the largest prin¬ 
cipal angle between im(Ui) and im(Ui). The equivalence 
between ( |113| l and ( | 1 1 4[ > follows straight from the definition 
of min and max correlation distances. It is now easy to verify 
that 1 — ei > <5i 2 + e 2 implies ( | 1 14| >, since 1 — ei is a lower 
bound for the cosine of the largest principal angles between 
Ui and Ui, and <Si 2 + e 2 is an upper bound to the cosine of 
the smallest principal angles between Ui and U 2 . 

Finally, the same arguments can be used to show that 
dmin(U 2 ,U 2 ) < dmax(U 2 ,Ui). This concludes the proof. 
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